The aim of this analysis is to investigate diabetes prevalence over time and by country/region. The purpose is to identify countries and years with high diabetes prevalence.
The dataset ‘DIABETES evolution of diabetes over time’ is a global dataset of diabetes prevelance from the years 1980 to 2014 and contains a total of 14,000 observations and 7 variables:
# Read in Data
data_full <- read_csv("Data/Diabetes_data.csv")
## Rows: 14000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country/Region/World, ISO, Sex
## dbl (4): Year, Age-standardised diabetes prevalence, Lower 95% uncertainty i...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_full_head <- head(data_full)
kable(data_full_head,
caption = "Table 1: First Six Observations of the Full Diabetes Dataset",
digits = 2)
| Country/Region/World | ISO | Sex | Year | Age-standardised diabetes prevalence | Lower 95% uncertainty interval | Upper 95% uncertainty interval |
|---|---|---|---|---|---|---|
| Afghanistan | AFG | Men | 1980 | 0.04 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1981 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1982 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1983 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1984 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1985 | 0.05 | 0.02 | 0.09 |
The full dataset was reduced to 1000 observations through a random generation of row numbers. The variable “ISO” was removed as it was not necessary for analysis.** Figure 1** below shows the code used to tidy the full dataset into the reduced dataset.
include_graphics("Image/code_screenshot.png")
Figure 3.1: Figure 1: Code Screenshot of Data Tidying
Using the function str() the first 2 rows of the data is displayed to show the type of variables in the data set (numeric, character/factor etc.). The assessment requires a maximum of 5 variables, but both the “lower_95” and “upper_95” were kept as they work together.
head_data_2 <- head(data,2)
str(head_data_2)
## tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
## $ Country/Region/World: chr [1:2] "Djibouti" "Sierra Leone"
## $ Sex : chr [1:2] "Men" "Women"
## $ Year : num [1:2] 2012 2006
## $ diabetes_prevalence : num [1:2] 0.0859 0.0563
## $ lower_95 : num [1:2] 0.0447 0.0349
## $ upper_95 : num [1:2] 0.1438 0.0833
Two summary statistics were calculated for diabetes prevalence by “Year”. Table 2 shows the results of the summary statistics.
tail_data_summary <- tail(data_summary, 10)
kable(tail_data_summary,
caption = "Table 2: Mean and Standard Deviation of Diabetes Prevalence by Year (First 10 Rows)",
digits = 3,
row_number(10))
| Year | mean_diabetes | sd_diabetes | mean_upper95 | sd_upper95 | mean_lower95 | sd_lower95 |
|---|---|---|---|---|---|---|
| 2005 | 0.091 | 0.046 | 0.127 | 0.059 | 0.034 | 0.034 |
| 2006 | 0.083 | 0.039 | 0.115 | 0.050 | 0.030 | 0.030 |
| 2007 | 0.100 | 0.060 | 0.140 | 0.077 | 0.045 | 0.045 |
| 2008 | 0.088 | 0.034 | 0.125 | 0.047 | 0.025 | 0.025 |
| 2009 | 0.095 | 0.048 | 0.139 | 0.063 | 0.035 | 0.035 |
| 2010 | 0.095 | 0.028 | 0.141 | 0.038 | 0.020 | 0.020 |
| 2011 | 0.088 | 0.030 | 0.130 | 0.039 | 0.022 | 0.022 |
| 2012 | 0.089 | 0.036 | 0.137 | 0.049 | 0.025 | 0.025 |
| 2013 | 0.103 | 0.057 | 0.158 | 0.079 | 0.038 | 0.038 |
| 2014 | 0.101 | 0.049 | 0.162 | 0.070 | 0.031 | 0.031 |
From Table 2 we can see an increasing trend in mean diabetes prevalence from 2005 to 2014. 2009 had the highest mean diabetes prevalence at 11.1% from the period 2005 to 2014, but also the highest standard deviation.
A figure was created using the ggplot2 R package and the option geom_point(). This is displayed in Figure 2:
Figure_2 <- ggplot(data = data_summary, aes(x = Year, y = mean_diabetes)) +
geom_point(alpha = 0.7) +
labs(title = "Figure 2: Mean Diabetes Prevalence Increases Over Time",
caption = "geom_smooth(` using method = 'loess' and formula = 'y ~ x'",
subtitle = "Red Bars Represent Standard Deviation") +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth() +
geom_errorbar(aes(ymin=mean_diabetes-sd_diabetes, ymax=mean_diabetes+sd_diabetes), colour="red", alpha=0.3)
ggplotly(Figure_2)
Australia_summary <- data_full %>%
filter(`Country/Region/World` == "Australia")
Figure_3 <- ggplot(data = Australia_summary, aes(x = Year, y = `Age-standardised diabetes prevalence`, col = Sex)) +
geom_point(alpha = 0.8) +
labs(title = "Figure 3: Men have Higher Risk of Diabetes",
caption = "geom_smooth(` using method = 'loess' and formula = 'y ~ x'",
subtitle = "Mean Diabetes Prevalence Has Increased Over Time") +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth()
Figure_3
Figure 3 ?? shows a trend of increasing mean diabates prevalence over time. Men have a noticeably higher mean than women. There is a steep increase from 1980 to 2000 and then a plateau. Data was only available up to 2010. It is unknown whether the plateua begins to trend downards.
Australia_table_summary <- data_full %>%
filter(`Country/Region/World` %in% c("Australia", "Germany", "China", "South Africa", "United States of America")) %>%
select(-ISO) %>%
group_by(`Country/Region/World`, Sex) %>%
summarise(`Mean diabetes prevalence` = mean(`Age-standardised diabetes prevalence`))
kable(Australia_table_summary,
caption = "Table 1: First Six Observations of the Full Diabetes Dataset",
digits = 3)
| Country/Region/World | Sex | Mean diabetes prevalence |
|---|---|---|
| Australia | Men | 0.064 |
| Australia | Women | 0.047 |
| China | Men | 0.060 |
| China | Women | 0.061 |
| Germany | Men | 0.056 |
| Germany | Women | 0.040 |
| South Africa | Men | 0.069 |
| South Africa | Women | 0.097 |
| United States of America | Men | 0.065 |
| United States of America | Women | 0.054 |
Five random countries were selected to compare mean diabetes prevalence by year and sex. Table 3 ?? presents that in Australia, Germany and United States of America, men have a higher mean diabetes prevalence than women. Mean diabetes prevalence for men and women in China are very similar with men being 0.001 higher. Interestingly, women in South Africa have a higher mean diabetes prevalence than men.